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Abstract. We present an algorithm which computes the Lempel-Ziv 
factorization of a word W of length n online in the following sense: it 
reads W starting from the left, and, after reading each r — O(logn) 
characters of W, updates the Lempel-Ziv factorization. The algorithm 
requires 0(n) bits of space and 0(nlog^ n) time. The basis of the algo- 
rithm is a sparse suffix tree combined with wavelet trees. 



1 Introduction 

The Lempcl-Ziv factorization (further LZ-factorization for short) of a word W 
is a decomposition W = f 1/2 ■■■ fz, where a factor ft, 1 < i < z, is either a 
character that does not occur in /1/2 . . . /i-i or the longest prefix of fi-.fz that 
occurs in /1/2 . . . fi at least twice |5j[20j. 

Probably, the most famous application of the LZ-factorization is data com- 
pression (e.g. the LZ-factorization is used in gzip, WinZip,and PKZIP). More- 
over, it is a basis of several algorithms |12||9) and text indexes [T5] . 

Let W he & word of length n on an alphabet S of size a. There are many 
algorithms that compute the LZ-factorization in 0(n log n) bits of spaceFl These 
algorithms use suffix trees , suffix automata [5, or suffix arrays |1I2I6I7I8I16) 
as a basis. 

However, only two algorithm have been known which use 0(n log cr) bits of 
space [17116] . The algorithms exploit similar ideas (both are based on an FM- 
index and a compressed suffix array). The algorithm [I6j is offline: it first reads 
the whole string and builds the necessary data structures and then computes 
the factors. The running time of this algorithm is linear. 

The algorithm ;17 is online. To understand the idea behind it, consider the 
factors /i, /2, . . . , /i of the LZ-factorization of a word X. The LZ-factorization 
of a word Xa, where a is a character, contains either i or i + 1 factors: in the first 
case the factors are /i, /2, . . . , fi-i, where the last factor fl = fiu; and in the 
second case the factors are /i, /2, ■ ■ ■ , fi, fi+i, where /i+i — a. The algorithm 
reads W and after reading each new character updates the LZ-factorization, i.e. 
either increases the length of the last factor by one or adds a new factor. The 
running time of the algorithm is 0(nlog'^ n). 



^ In this paper log stands for log2 



In the case when the size of the input data is big, it would be natural to 
allow updating the LZ-factorization only each r > 1 new characters of W, for 
some small parameter r. Unfortunately, naive application of this idea to the 
algorithm |17[ does not improve its running time. 

Here we propose a new online algorithm based on a combination of a sparse 
sufEx tree and wavelet trees. The algorithm updates the LZ-factorization of W 
each r = " characters of W. Our algorithm requires 0(n log cr) bits of space 
and O(nlog^rt) time. 



2 Preliminaries 

Let X be a word of length \X\ on 2J. Positions in X are numbered from 1. The 
subword of X from position i to position j (inclusively) is denoted by If 
j = \X\, then we write X[i..] instead of A word X[i..] is called a suffix 

of X and a word is called a prefix of X. 

For each word Y of length r on Z" we consider a meta-character Y' formed 
by concatenating bit-representations of characters of Y. Obviously, there is one- 
to-one correspondence between meta-characters and words of length r on S. 

Note that a bit representation of any character of Y can be obtained from 
the bit representation of Y' by two shift operations. Also, Y' can be obtained 
from Y in 0(r) time by standard bit-vector operations. 



3 Algorithm 

Let /i;/2,---/z be the factors of the LZ-factorization of W. For the sake of 
clarity we describe an algorithm which sequentially computes /i , /2 , • • • /z a-nd 
returns a position of a previous occurrence for each factor (not necessarily the 
leftmost one). However, it will be easy to see how to modify this algorithm to 
solve the problem we formulated in the introduction. 

Let W' to be the meta-word formed by splitting W into blocks of r consecu- 
tive characters and replacing each block with the corresponding meta-character. 
Obviously, there is no difference which word to work with, but for us it will be 
easier to explain the algorithm in the terms of W' . 

Suppose that /i, /2, . . . fi-i of common length £i have been computed. The 
algorithm consists of two procedures. The procedure P<r checks if \fi\ is less 



than r and, if it is, computes /j (Section 3.2). The procedure P>r computes 



fi only if it is already known that \fi\ > r (Section 3.3 1. To compute fi the 



algorithm runs P<r first and then, if necessary, runs P>r- 



3.1 Data Structures 

The algorithm maintains two dynamic data structures updating them immedi- 
ately after reading a new character of W . The procedure P<r uses the first one 
and the procedure P>r uses the second one. 



After reading the first data structure is a compacted trie on suffixes 

of words W[rj + l..r{j + 2)], j = Q..t — 2. Eacli explicit vertex v of tlic trie stores 
a starting position of one of the suffixes ending in the subtree rooted at v. 

The second one is an implicit suffix tree for W^'[l..t]. This tree is also called a 
sparse suffix tree for Vl^[l..tr] |3llOI4j . though the original definition of a sparse 
suffix tree is slightly different \TT\ . 

For each explicit vertex v of the suffix tree we store a compacted trie CTy 
on words of length r corresponding to the first meta-characters on the edges 
outgoing from v. 

Definition 1. Consider a tree with labels on edges (a suffix tree or a trie). We 
say that a word X is represented by a vertex v (or that v represents X ), if the 
path from the root of the tree to v is labelled by X . 

If the label of an edge {v, u) of the suffix tree begins with a meta-character 
Y' , and y is a corresponding word of length r, then we store a pointer to (v, u) 
in the leaf of CT„ representing Y. The tries in vertexes are used for navigation in 
the suffix tree (but not only for it) . Clearly, given a vertex and a meta-character, 
it takes 0{r) time to find the edge outgoing from the vertex, the label of which 
starts with the meta-character. 

The suffix tree is updated by Ukkonen's algorithm [TO] , 

Definition 2. Block borders are positions of W of the form pr + I, where p is 
an integer in an interval [1, [^J]- 

Let By be a set of block borders corresponding to the starting positions of the 
suffixes represented by the leaves of the subtree rooted at an explicit vertex v. We 
store an additional data structure which allows, given v, a word Y G and 
a block border b, to determine whether By\{b} contains a block border preceded 
by an occurrence of Y. If there are such block borders, the data structure reports 
one of them. The query takes 0(log^ n) time. 

Details of implementation are not important to understand the algorithm 
and will be explained later, in Section |4j 

Hereafter [^J is denoted by We assume that the algorithm has read 
W'[1.I^ + 1] before running the procedures P<r and P>r- 

3.2 Procedure P^r 

Lemma 1. W[li + l.I i +r] occurs in the words W[rj + l..r[j + 2)], j =^ 0..£'^~1, 
^ff\f^\>r (seeFtg.\l^. 

We traverse the trie starting at the root and following edges labelled with 
the characters of W[£i + l.li + r]. Two cases are possible: either we will read 
out the whole word W[£i + l.li + r] or we will stop after reading W[£i + l..s], 
s < £i + r, and will not be able to proceed. It follows from the lemma that in 
the first case |/i| > r and in the second case \fi\ < r. Moreover, it is easy to see 
that in the second case \fi\ is equal to \W[£i + l..s]| and that we can report a 
previous occurrence of fi in 0(1) time. 

Obviously, P<r takes 0(|/i|) time in both cases. 
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Fig, 1. Case 1/^1 >r,r = 4:. The word t^[rA; -f l..r(A: -f 2)], k < containing an 
occurrence of W[ii + + r] is highlighted in grey. Block borders are in bold. 
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Previous occurrence of W'[i|'..s-1] 



Fig. 2. Relation between W'[£[..s - 1] and W[£i + l..r(s - 1) 



3.3 Procedure P>r 

P>r consists of two steps. Tlie first one is preliminary, during it we only read 
W and update the data structures. During the second step we compute 

The First Step P>r starts with reading W' until a position s such that in 
the suffix tree of M^'[l..s] the suffix 14^'[€^..s] is represented by a leaf. From the 
description of Ukkonen's algorithm [1^ it follows that all suffixes starting at 
positions less than will be represented by leaves. 

Lemma 2. During this step we read at most \^-^~\ characters ofW. 

Proof. Since s is the minimal position such that Vl^' [i?^..s] is represented by a leaf, 
W[^^..s — f] is represented by an inner vertex in the suffix tree of M^'[f ..s — f] and, 
consequently, occurs before the position f ■ in W . Therefore, W[£i + l..{s — l)r] 
occurs before the position 4 (see Fig. [2]) and \fi\ > \W[ii + f..(s — l)r]| > 
r\W'[i^ + f..s — f]|. Inverting the inequality, we obtain the desired result. 

We initialize M with |W^[£i..(s — f The lemma guarantees that \ fi\ > M. 
During the computation process we will increase M until, finally, it will become 
equal to 

Definition 3. Depth of a vertex v of the suffix tree is the length of the word 
represented by v. 

Lemma 3. Let v be an explicit inner vertex of the suffix tree of W'[l..s] with 
depth at least \_^\- If a block border belongs to the set By, then it is not bigger 
than + f)r +'f. 



Proof. Indeed, a subtree rooted at v can only contain leaves representing suffixes 
of length at least [^J — s — £'^, and all such suffixes start at positions < ^ + 1. 
The statement immediately follows. 

We will read a new character of W and update the data structures only 
when ^i±M jg bigger than the position of the last character of W we have read. 
This will guarantee that a statement similar to the statement of the lemma will 
be true throughout P>r- 

The Second Step Consider the first block border which intersects a previous 
occurrence of fi (see Fig. [s]). It divides the occurrence into two parts: the first 
short part equal to W[£i + l.li + k — 1] and the second part equal to a prefix of 
W[e^ + k..],ke [l,r]. 

Let /f be the longest prefix of W[£i + k..] with at least one occurrence at 
a block border which is less than £i and preceded by an occurrence of W[£i + 
l.Ii + k-l]. Obviously, \fi\ = maxfcg[i^^](|/f | + fc - 1). 

For each k — \..r the procedure P>j. either computes \ fi \ and updates M or 
proves that \fi \ + fc — 1 < M and starts computation of 

Let W^'._|_fc be a meta-word formed by blocking every r characters oiW[£i+k.] 
into a single meta-character. Note that each character of W^ j^^. can be obtained 
by at most two shift operations from appropriate characters of W , therefore 
there is no need to compute W'^.j^^ in advance or to store it explicitly. 

Remark 1. If a path from the root of the suffix tree to a vertex v is labelled by 
W^,_^f.[l..m\, then W[£i + k.li + k + mr] occurs at all block borders of By. 

This remark gives us the idea of how \f^ \ can be computed. We traverse 
the suffix tree starting at the root and following the edges labelled with the 
characters of W^'._[_fc. Let v be an explicit vertex representing a word Ty£._|_j,[l..p], 
where p > ^'^^'^^ . Clearly, | | > pr + k ~ 1 iff S^, contains a block border less 
than £i preceded by an occurrence of W[£i + + fc — 1]. 

Since the depth of v is equal to p > ^'^^'^^ +1 > [^J , By can contain 
only one block border bigger than £i, namely, + l)r + 1 (Lemma [s]). So, 
\fi\ > pr + k — 1 iS By \ {{£^ + l)r + 1} contains a block border preceded by 
an occurrence of W[£i + l..£i + k — I], and this is exactly the type of questions 
the additional data structure for the suffix tree can answer for (see Section [4]). 
If there is such a border, the algorithm updates M by + fc — 1 and proceeds. 
If no, then the algorithm starts computation of 



Previous occurrence of f j 



Fig. 3. A previous occurrence of /,;. The part equal to W[£i + l..£i + k — l] {k — 4) 
is highlighted in grey. 



This completes the description of the second step of the algorithm. However, 
several technical difRculties remain. 

Technical DifRculties of the Second Step First of all, Lemma [3] works 
only for inner vertexes. If during the traverse we arrive to a leaf of the suffix 
tree, we first check if this leaf correspond to a block border less than £i and 
then check if the border is preceded by an occurrence of W[£i + l..£i + k — 1] 
using a character-by-character comparison. After that, the algorithm proceeds 
as described earher. 

Secondly, suppose that during the traverse we stop in a vertex v representing 
a word Ty^'._|_^[l..p] and cannot proceed the reading. This means that W[£i + 
k..{p + l)r] does not occur at block borders in the prefix of W corresponding 
to the prefix of W' which has been read. But, this can be false for a word 
W[£i + k..pr + q], q < r. Next two paragraphs explain how to find the biggest 
q < r such that a word W[£i + k..pr + q] occurs at a block border preceded by 
an occurrence of W[£i + l..£i + k ~ 1]. 

Two cases are possible depending on whether v is implicit or explicit. Let v 
be implicit and u be the lower end of the edge containing v. To find q, we first 
ask if Bu \ {{£i + i)r + 1} contains a block border preceded by an occurrence 
of W[£i + l.Ii -|- fc — 1]. If it does, we compare the word corresponding to the 
next meta-character on the edge with the word corresponding to the next meta- 
character of W^'._|_j, character by character to find the length of their longest 
common prefix, which obviously will be equal to q. 

Suppose now that v is an explicit vertex. We traverse CT„ starting at the 
root and following the edges labelled with the characters of W[pr + l..{p + l)r]. 
Let u be an explicit vertex of CT^ representing a word W[pr + l..pr + t], where 
pr+t+k—1 > M. Suppose that wi, U2, ■ • • , are the sons of v corresponding to 
the leaves of the subtree of CT„ rooted at u. Obviously, a word W[£i + k..pr + 1\ 
occurs at all block borders in the set Bu^ U B^^ U . . . U . Moreover, the set 
Bui U U . . . U B^^ can contain only one block border bigger than f^, namely, 
{£'^ -I- l)r -f 1 (Lemmajsj). 

In each vertex u with such properties we ask whether the set B^^ U i?„2 U 
. . . U Bu^ \{{£[ + l)r + 1} contains a block border preceded by W[£i..£i -t- fc — 1]. 
From the description of the additional data structure we store for the suffix tree 
(Section [4| it will be clear that such queries also can be answered in O(log^n) 
time. 

It is important that after asking the additional data structure we either 
increase M or proceed to the computation of /f^^. Therefore, there will be not 
more than r -|- such queries during the second step. 

4 Data Structures in Details 

As we have already said, our algorithm maintains two data structures. In this 
section we give the details and describe update procedures. 



4.1 Trie 



After reading the trie contains sufBxes of words W[rj + l..r{j + 2)], 

j = 0..t — 2. To update the trie after reading W'[t + 1] we first check if W^[r(t — 
1) + l..r(i + 1)] is represented in the trie. To do that we traverse the trie starting 
at the root and fohowing edges labelled with the characters of W^[r(t — 1) + 
l..r{t + 1)]. If we read out the whole word, then W[r(i — 1) + l..r{t + 1)], and, 
consequently, all its suffixes are represented in the trie. If not, we add all suffixes 
of W[r{t — 1) + l..r{t + 1)], including the word itself, to the trie. 

Lemma 4. The trie occupies o{n) bits and its maintenance takes 0{n) time. 

Proof. Due to our choice of r, there are at most cr^'" = cr ""^ — different 
words of length 2r on S. Therefore, the trie has o{n^r) = o(n) leaves and the 
space bound is proved. 

To check if the words W[rj + l..r(j + 2)], j = 0..^ — 2, are represented in 
the trie one needs 0{n) time in total. As there are at most ni < ^ different 
words of length 2r, we add suffixes of less than ^ words. All suffixes of a word 
of length 2r can be added to the trie in O(r^) time, so we get the linear time 
bound. 

Finally, suppose that we create a new vertex v in the process of adding a 
suffix W^[p..(7] of the word W[q — 2r + l..q] to the trie. Then we just remember 
the position p as a starting position of a suffix ending in the subtree rooted at 
V. This completes the description of the update procedure. 

4.2 Suffix Tree 

The suffix tree is updated by Ukkonen's algorithm [TH]. When we create a new 
edge outgoing from a vertex v with the first character of the label equal to W [j] , 
we add W[{j — l)r + l..j>] to CTy. Obviously, this step takes 0{r) time. 

Below we describe the additional data structure which allows, given an ex- 
plicit vertex v, a word Y € IJ^^'^\ and a block border b, to determine whether 

\ {b} contains a block border preceded by an occurrence of Y in 0(log^ n) 
time. 

We define a meta-character Cmin as follows: reverse the bit representation of 
Y and then append (r — log a zeros to it. A meta-character Cmax is defined in 
a similar way, but ones are appended instead of zeros. Obviously, a block border 
pr -|- 1 is preceded by an occurrence of Y iff the reverse of the bit representation 
oiW'lp- 1] fies in the interval [cmin, Cmax]- 

Let Pi be the starting position of the suffix represented by the i-th leaf in 
the left-to-right order on the leaves of the suffix tree. Consider virtual sequences 
GBWT, GBWT[i] equal to the reverse of the bit representation of W'[pi — 1], 
and B, B[i] equal to the block border pir + 1. We store GBWT and B in dynamic 
wavelet trees (Theorem 9 [II]). Note that a in Theorem 9 denotes the size of 
the alphabet of the sequence, i.e. logcr = O(logn) for GBWT and B. Assuming 
g = 2, updates of the wavelet trees cost 0(log^ n) time. 



Let l{v) and r{v) be the minimal and the maximal ranks of leaves of the 
subtree rooted at v in the left-to-right order on the leaves of the sufhx tree. 
Then By = {B[k]\k e [l{v),r{v)]} and the subset C By of block borders 
preceded by Y can be defined as B^ = {B[k]\k G [l{v),r{v)] and GBWT[k] e 
[cmin, Cmax]}- Clearly, By\{b} contains a block border preceded by an occurrence 
of Y iff B^ contains a block border different from b. 

From the description of the dynamic wavelet trees and discussions in |15) it 
follows that each block border belonging to B^ can be retrieved in O(log^n) 
time. Obviously, it is enough to retrieve at most two block borders to determine 
whether B^ contains a block border different from b. 

It remains to show how the minimal and the maximal ranks of the leaves in 
the subtree rooted at v can be computed. 

First we describe a static data structure. Consider a bit vector of the length 
m < 4n with zeros in positions corresponding to the visits of inner vertexes and 
ones in the positions corresponding to the visits of leaves in the Euler tour of 
the suffix tree. We store the vector in a balanced binary tree, whose leftmost 
leaf contains first logm bits of the vector, the second leftmost leaf contains the 
next log m bits, and so on. In each vertex of the balanced binary tree we store 
the number of bits set in the subtree rooted at this vertex. 

Consider the leave of the balanced binary tree storing the bit corresponding 
to the first visit of an explicit vertex v of the suffix tree in the Euler tour. For 
each V we store a pointer to the leave and the number of the bit in the segment 
of the vector stored in the leave. We also store similar information connected 
with the last visit of v. 

The minimal rank of a leaf in the subtree rooted at v is the number of bits 
set before the pointer corresponding to the first visit of v plus one. This number 
can be computed in O(logm) time by scanning the segment of the bit vector 
which contains the bit corresponding to the first visit of v and then going along 
the path from the leaf to the root, summing up the number of bits set lefter 
than this path. The maximal rank can be computed in the same way. 

This data structure is a slight modification of the one proposed in P3] (Sec- 
tion 3.3). We modify it only because of the pointers. We make it dynamic ex- 
actly like Navarro and Makinen do. During the update process only O(logm) 
bits change their positions in leaves and therefore only O(logm) pointers need 
to be updated. Whenever log m changes, we perform a global rebuild (obviously, 
this will result in 0(1) additional amortized time per insertion). 

To update the wavelet trees after adding a new leaf to the suffix tree, we only 
need to know the rank of this leaf in the left-to-right order on the leaves of the 
suffix tree, and it can be computed in O(logn) time using the data structure we 
have just described. 

Lemma 5. The suffix tree and additional structures occupy 0(nloga) hits and 
their maintenance takes 0{n\ogn) time. 

Proof. The suffix tree has at most ^ leaves and therefore occupies 0(n log cr) 
bits. Tries in vertexes have O(^) leaves in total and occupy 0{nloga) bits as 



well. The dynamic bit vector uses O(^) bits of space and the dynamic wavelet 
tree uses 0(" logn) = 0{nloga) bits of space. 

Ukkonen's algorithm [TU] takes OC^r) time (additional r appears because of 
the cost of navigation). To update tries in vertexes we need 0{^r) — 0{n) time. 
All wavelet tree updates take 0{- log^ n) — 0{n\ogn) time. And finally, update 
of the balanced binary tree and the pointers takes logn) = 0{n) time. 

5 Results and Conclusions 

To conclude, we prove the following theorem. 

Theorem 1. The presented algorithm computes the Lempel-Ziv factorization of 
a word W in O(nlog^n) time and 0{n\oga) bits of space. 

Proof. Lemmas |4] and [5] guarantee that the data structures occupy 0{n\oga) 
bits of space in total and that their maintenance takes O(nlogn) time. 

To compute first P<r is run. As we have proved, it takes 0(|/i|) time. 
We show now that Py^ takes 0(|/i| log^ n + log'^ n) time. Indeed, no more than 

l/i I queries are asked to the wavelet trees, each of them taking 0(log^ n) time. 

To follow down in the suffix tree we need O(^^r) = 0(|/i|) time (remember 
that transition from a vertex to its son takes 0{r) time). Therefore, the total 
time spent during P>r is 0{{r + |/i|) log^ n + r\fi\) = 0{\fi\ log^ n + log'^ n). We 
run P>r only when \fi\ > r, obviously, it happens not more than " times. This 
completes the proof. 

Remark 2. It is easy to see that the described algorithm can be implemented as 
online algorithm with the same running time and working space. 

We have presented a new online algorithm for computation of the Lempel-Ziv 
factorization based on a combination of a sparse suffix tree and wavelet trees. 
The algorithm has better running time than the only previously known online 
algorithm for Lempel-Ziv factorization with linear working space |17j . although, 
as we should mention, the time per character in our algorithm can be worse than 
in the algorithm [T7] . 

The algorithm is based on interesting techniques which we believe might open 
new possibilities for computations on sparse suffix trees, in particular, we show 
that the combination of a sparse suffix trees and wavelet trees, which has been 
previously used only in static settings, can be built online. 

A challenging problem for future research is to reduce the running time of 
our algorithm, which probably can be obtained by using another data structures 
instead of the wavelet trees. 
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